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Abstract 

Approximate Bayesian computation (ABC) is a powerful and elegant framework for performing inference in 
simulation-based models. However, due to the difficulty in scaling likelihood estimates, ABC remains useful 
for relatively low-dimensional problems. We introduce Hamiltonian ABC (HABC), a set of likelihood-free 
algorithms that apply recent advances in scaling Bayesian learning using Hamiltonian Monte Carlo (HMC) and 
stochastic gradients. We find that a small number forward simulations can effectively approximate the ABC 
gradient, allowing Hamiltonian dynamics to efficiently traverse parameter spaces. We also describe a new 
simple yet general approach of incorporating random seeds into the state of the Markov chain, further reducing 
the random walk behavior of HABC. We demonstrate HABC on several typical ABC problems, and show that 
HABC samples comparably to regular Bayesian inference using true gradients on a high-dimensional problem 
from machine learning. 


1. INTRODUCTION 

In simulation-based science, models are defined by a simulator and its parameters. These are called likelihood- 
free models because, in contrast to probabilistic models, their likelihoods are either intractable to compute 
or must be approximated by simulations. To perform inference in likelihood-free models, a broad class of 
algorithms called Approximate Bayesian Computation (Beaumont et ak, 2002; Marjoram et ak, 2003; Sisson 
et ak, 2007; Sisson & Fan, 2010; Marin et ak, 2012; Fan et ak, 2013) are employed. 

At the core of every ABC algorithm is simulation. To evaluate the quality of a parameter vector 6, a 
simulation is run using 6 as inputs and producing outputs x. If the pseudo-data x is “close” to observations 
y, then 6 is kept as a sample from the approximate posterior. Parameters 9 are then adjusted, depending upon 
the algorithm, to obtain the next sample. 

In ABC, there is a fundamental trade-off between the computation required to obtain independent samples 
and the approximation to the true posterior. If the parameter measuring closeness is too small, then samplers 
“mix” poorly; on the other hand, if it is too large, then the approximation is poor. As the dimension of 
the parameters grows, the problem worsens, just as it does for general Bayesian inference with probabilistic 
models, but it is more acute for ABC due to its simulation requirement. There is therefore a deep interest in 
improving the efficiency of ABC samplers (in terms of computation per independent sample). In this paper 
we address this issue directly by using Hamiltonian dynamics to approximately sample from likelihood-free 
models with high-dimensional parameters. 

Hamiltonian Monte Carlo (HMC) (Duane et ak, 1987; Neal, 2011) is perhaps the only Bayesian inference 
algorithm that scales to high-dimensional parameter spaces. The core computation of HMC is the gradient 
of the log-likelihood. Two problems arise if we consider HMC for ABC: one, how can the gradients be 
computed for high-dimensional likelihood-free models, and two, given a stochastic approximation to the 
gradient, can a valid HMC algorithm be derived? 
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To answer the latter, we turn to recent developments in scaling Bayesian inference using HMC and 
stochastic gradients (Welling & Teh, 2011; Chen et ah, 2014; Ding et ah, 2014). We call these stochastic 
gradient Hamiltonian dynamics (SGHD) algorithms. SGHD are computationally efficient for two reasons. 
First, they avoid computing the gradient of the log-likelihood over the entire data set, instead approximating 
it using small batches of data, i.e. computing stochastic gradients. Second, they can maintain reasonable ap¬ 
proximations to the Hamiltonian dynamics and therefore avoid a Metropolis-Hastings correction step involv¬ 
ing the full data set. Different strategies are employed to do this: small step-sizes combined with Langevin 
dynamics (Welling & Teh, 2011), using friction to prevent accumulation of errors in the Hamiltonian (Chen 
et ah, 2014), and using a thermostat to control the temperature of the Hamiltonian (Ding et ah, 2014). Each 
of these strategies can be used by HABC. 

In HABC, we use forward simulations to approximate the likelihood-free gradient. The key difference be¬ 
tween SGHD methods and HABC is that the stochasticity of the gradient does not come from approximating 
the full data gradient with a mini-batch gradient, but by the stochasticity of the simulator. It is therefore not 
the expense of the simulator (though this could very well be the case for many interesting simulation-based 
models - see Section 7) that requires an approximation to the gradient, but the likelihood-free nature of the 
problem. 

There are several difficulties in estimating gradients of likelihood-free models that we address with 
HABC. The first is due to the form of the ABC log-likelihood. As we show in Section 2, using a condi¬ 
tional model for 7r(x|0) provides an estimate of the ABC likelihood that is less sensitive to e and therefore 
is more conducive to stochastic gradient computations. The second difficulty is that for high-dimensional 
parameter spaces, computing the gradients naively (i.e. by finite differences (Kiefer et ah, 1952)) can squash 
the gains brought by the Hamiltonian dynamics. Fortunately, we can use existing stochastic approximation 
algorithms (Spall, 1992, 2000) that can be used to compute unbiased estimators of the gradient with a small 
number of forward simulations that is independent of the parameter dimension. The stochastic perturbation 
stochastic approximation (SPSA) (Spall, 1992) is described in Section 4 

A further innovation of this paper is the use of common random numbers (CRN) to improve the efficiency 
of the Hamiltonian dynamics. The idea behind CRNs is to use the same set of random seeds for estimating a 
gradient by FD or SPSA, i.e. when simulating 7r(x|0 -f d9) and 7r(x|0 — d9) use the same random seeds. This 
was applied successfully to SPSA (Kleinman et ah, 1999) (and is analogous to using the same mini-batch in 
stochastic gradient methods). We extend and simplify this approach by including the random seeds oj into the 
state of the Markov chain; by keeping the random seeds fixed for several consecutive steps, the second order 
gradient stochasticity is greatly reduced. We show that doing this produces a valid MCMC procedure. This 
approach is not exclusive to HABC; our experiments show it also helps random-walk ABC-MCMC. 

We briefly review ABC in Section 2. In Section 3 we review three approaches to stochastic gradient 
inference using Hamiltonian dynamics: SGFD, SGHMC, and SGNHT. We then introduce Hamiltonian ABC 
in Section 4, where we will show how to improve the stability of the gradient estimates by using CRNs 
and local density estimators of the simulator. Extensions to high-dimensional parameter spaces are also 
discussed. In Section 5 we show how HABC behaves on a simple one-dimensional problem, then in Section 6 
we compare HABC with ABC-MCMC for two problems: a low-dimensional model of chaotic population 
dynamics and a high-dimensional problem. 

2. APPROXIMATE BAYESIAN COMPUTATION 

Consider the Bayesian inference task of either drawing samples from or learning an approximate model of 
the following (usually intractable) posterior distribution: 

7r(0|yi,---,yAr) OC 7r(0)7r(yi,...,yjv|6>) (1) 

where 7r(0) is a prior distribution over parameters 9 G IR^ and 7r(yi,..., ynIS) is the likelihood of N data 
observations, where y^ G H'^. In ABC, the vector of J observations are typically informative statistics of the 
raw observations. It can be shown that if the statistics used in the likelihood function are sufficient, then these 
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algorithms sample correctly from an approximation to the true posterior (Marin et ah, 2012). The simulator 
is treated as generator of random pseudo-observations, i.e. x 7r(x|0) is a draw from the simulator. 
Discrepancies between the simulator outputs x and the observations y are scaled by a closeness parameter e 
and treated as likelihoods. This is the equivalent to putting an e-kernel around the observations, and using a 
Monte Carlo estimate of the likelihood using S draws of x; 

g 

T^eivlO) = J 7re(y|x)7r(x|0)dx « ^^7re(y|x(^)) (2) 

In ABC Markov chain Monte Carlo (MCMC) (Marjoram et ah, 2003; Wilkinson, 2013; Sisson & Fan, 
2010) the Metropolis-Hastings (MH) proposal distribution is composed of the product of the proposal for the 
parameters 9 and the proposal for the simulator outputs: 

q{e', x(i)' ,..., x(«)>) = qie'\e) [] 7r(x(^)>') (3) 


Using this form of the proposal distribution, and using the Monte Carlo approximation eq 2, we arrive at the 
following Metropolis-Hastings accept-reject probability. 


a = min 


V ’ '^WT,Li'^eiy\x)q{e'\e) j 


(4) 


If the simulations are part of the Markov chain, the algorithm corresponds to the pseudo-marginal (PM) 
sampler (Andrieu & Roberts, 2009), otherwise it is a marginal sampler (Marjoram et ah, 2003; Sisson & Fan, 
2010). For this paper we will be interested in the PM sampler because this is equivalent to having the random 
states that generated the simulation outputs in the state of the Markov chain, which we will use within a valid 
ABC sampling algorithm in Section 4. 

An alternative approach to computing the ABC likelihood is to estimate the parameters of a conditional 
model 7r(x|0), e.g. using kernel density estimate (Turner & Sederberg, 2014) or a Gaussian model (Wood, 
2010). While either approach should be adequate and both have their own limits and advantages, for this 

paper we will use a Gaussian model. In ABC, using a conditional Gaussian model for 7r(x|0) is called a 

synthetic likelihood (SL) model (Wood, 2010). For a SL log-likelihood model, we compute estimators of the 
first and second moments of 7r(x|0). The advantage is that for a Gaussian e-kernel, we can convolve the two 
densities 

T^e{y\0) = j N{y\^,e^)N{yi\ne,al)dx (5) 

= A/'(y|/ie,CTe-f e^) (6) 


Of particular concern to this paper is the behavior of the log-likelihoods for different values of e. In the 
e-kernel case, the log-likelihood is very sensitive to small values of e: 


log7r,(y|0) = log^AA(y|x(*),e2) 


(7) 


= logAA(y|x(*),e2)+log(l-f i/) 


-loge-^(y-x(-))2 


(8) 

(9) 


where m is the simulation that is closest to y, iT is a sum over terms close to 0. We can see that the log- 
likelihood can be set arbitrarily small by decreasing e. On the other hand, by using a model of the simulation 
at 6 


log7re(y|6/) 


-^log(o-e + e^) 


2{al + e^) 


( 10 ) 
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For the SL model, e acts as a smoothing term and can be set to small values with little change to the log- 
likelihood, as long as the SL estimators are fit appropriately. This insensitivity to e will be used in Section 4 
for estimating gradients of the ABC likelihood. Before describing HABC in full detail however, we now ex¬ 
plain how scaling Hamiltonian dynamics in Bayesian learning can be accomplished using stochastic gradients 
from batched data. 

3. SCALING BAYESIAN INFERENCE USING HAMILTONIAN DYNAMICS 

Scaling Bayesian inference algorithms to massive datasets is necessary for their continuing relevance in the 
so-called big data era. We now review the role stochastic gradient methods combined with Hamiltonian 
dynamics have played in recent advances in scaling Bayesian inference. Most importantly, these methods 
have combined the ability of HMC to explore high-dimensional parameter spaces with the computational 
efficiency of using stochastic gradients based on small mini-batches of the full dataset. After an overview 
of HMC, we will briefly describe stochastic gradient Hamiltonian dynamics (SGHDs), starting with using 
Langevin dynamics (Welling & Teh, 2011), then HMC with friction (Chen et ak, 2014), and finally HMC 
with thermostats (Ding et ak, 2014). We will then make the connection between SGHDs and HABC in 
Section 4. 

3.1 Hamiltonian Monte Carlo 

Hamiltonian dynamics are often necessary to adequately explore the target distribution of high-dimensional 
parameter spaces. By proposing parameters that are far from the current location and yet have high acceptance 
probability, Hamiltonian Monte Carlo (Duane et ak, 1987; Neal, 2011) can efficiently avoid random walk 
behavior that can render proposals in high-dimensions painfully slow to mix. 

HMC simulates the trajectory of a particle along a frictionless surface, using random initial momentum p 
and position 6. The Hamiltonian function computes the energy of the system and the dynamics govern how 
the momentum and position change over time. The continuous Hamiltonian dynamics can be simulated by 
discretizing time into small steps rj. If p is small, the value of 6 at the end of a simulation can be used as 
proposals within the Metropolis-Hastings algorithm. Hamiltonian dynamics should propose 6 that are always 
accepted, but errors due to discretization may require a Metropolis-Hastings correction. It is this correction 
step that SGHD algorithms want to avoid as it requires computing the log-likelihood over the full data set. 

More formally, the Hamiltonian H {9, p) = U{9) -\- K{p) is a function of the current potential energy 
U{9) and kinetic energy K{p) = p’^M~^p/2 (M is a diagonal matrix of masses which for presentation are 
set to 1). The potential energy is defined by the negative log joint density of the data and prior: 

N 

C/(0) = -log7r(0)-^log7r(y,|0) (11) 

i=l 

The Hamiltonian dynamics follow 

de = pdt dp = -VU{e)dt ( 12 ) 


in simulation dt = rj. 

3.2 Stochastic Gradient Hamiltonian Dynamics 

If the log-likelihood over the full data set is replaced with a mini-batch estimate, as is done for the following 
stochastic gradient Hamiltonian dynamics (SGHDs) algorithms, then the error in simulating the Hamiltonian 
dynamics comes not only from the discretization, but from the variance of the stochastic gradient. As long 
as this error is controlled, either by using small steps rj (SGLD), or adding friction terms B (SGHMC), or 
using a thermostat ^ (SGNHT), the expensive MH correction step can be avoided and values of 6 from the 
Hamiltonian dynamics can be used as approximate samples from the posterior. 
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SGHDs replace the full potential energy and its gradient with a mini-batch approximation: 


/V 

U{e) = -log7r(0)-- Vlog7r(y,|0) 

n 

(13) 

Af 

VU{e) = -Vlog7r(0)-- V viog7r(y,|0) 

n 

i—ht 

(14) 

where n is the mini-batch size, and hi are indices chosen randomly without replacement from [1, N] (i.e. it 
defined a random mini-batch). 

Stochastic gradient Langevin dynamics (Welling & Teh, 2011) performs one full leap-frog step of 
HMC. Starting with a half step for the momentum, the update for 9 is 

Pt ~ AA(0,4) 

(15) 

ft+i = pt-v^Ui9t)l2 

(16) 

9t+i = 9t+r]Pt+i 

(17) 

It is not necessary to include p in the updates since there is only one step: 


9t+i = 9t + r^N{Q,Ip)-ri^VU{9t)/2 

(18) 


One of the potential drawbacks of SOLD is that the momentum term is refreshed for every update of the 
parameters, and since this means the parameter update only uses the current gradient approximation, it limits 
the benefits of using Hamiltonian dynamics. On the other hand, this also prevents SOLD from accumulating 
errors in the Hamiltonian dynamics. 

Stochastic Gradient HMC (SGHMC) (Chen et ah, 2014) avoids p refreshment altogether. By applying 
HMC directly using the stochastic approximation U and VU, which the authors call naive SGHMC, the 
variance of the gradient will introduce errors that left unaddressed will result in sampling from the incorrect 
target distribution. Under the assumption that VU {9) — VU{9) + N (0, Vg), where Ve is the covariance of 
the gradient approximation, and updates pt+i = Pt + ^Pt and 9t+i = 9t+ rjpt+i, the change in momenta 
Ap from one full step is 

-77 (V[/(0) +ff{0,Ve)) = -vVUie) + ff (0, r]^Vg) (19) 

By adding a friction term B to Ap proportional to Vg, the correction step can be avoided 


Ap = -7^Bpt-riVU{et)+N'iO,2rjB) 


( 20 ) 


where B = ^r/Vg^. In practice, since we can only estimate B by some B and can only compute C/, a user 
defined friction term C is used (with C — B is semi-positive definite). Thus the updates used for Ap for 


SGHMC: 


-r]Cpt - ilSJUiOt) + M (o, 2r7(C - B)) 


( 21 ) 


In our experiments we compute an online estimate V and set C = cip + V. 

Stochastic Gradient thermostats (SGNHT) (Ding et ah, 2014) addresses the difficulty of estimating B 
by introducing a scalar variable ^ who’s addition to the Hamiltonian dynamics maintains the temperature of 
the system constant, i.e. it acts as a (Nose-Hoover) thermostat (Leimkuhler & Reich, 2009). The update 
equations remain simple: initialize ^ = C (or c), then for t = 1... 


pt+i = pt-TiitPt-ri-^U{et)+N{0,2T^tC) (22) 

9t+i = 9t + rjpt+i (23) 

6-ti = ^t+v{pI+iPt+i/D-l) (24) 


In summary, the hyperparameters required for these algorithms are rj and C (for SGHMC and SGNHT only), 
and in practice, some way of estimating V for SGHMC. 
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4. HAMILTONIAN ABC 

The general approach of applying Hamiltonian dynamics to ABC requires choosing one of the SGHD algo¬ 
rithms and then plugging in the ABC gradient approximation VU{9). With this in mind we leave the details 
of the Hamiltonian updates to previous work (Welling & Teh, 2011; Chen et ah, 2014; Ding et ah, 2014) and 
focus on the details of how stochastic gradients are computed in the likelihood-free setting. 

4.1 Deterministic Representations of Simulations 

Implicit in each simulation run x 7r(x|0) is a sequence if internally generated random numbers that are 
used to produce random draws from 7r(x|0). These random numbers are important to HABC because we wish 
to control the stochasticity of the simulator when computing its gradient. Furthermore, we will control the 
random numbers over multiple time steps. Instead of keeping track of random numbers, we can equivalently 
keep a vector of S random seeds uj. This allows HABC to treat the simulation function 7r(x|0) as a blackbox, 
outside of which we can control the random number generator (RNG), and represent x^®^ as the output of a 
deterministic function; i.e. x*^®^ = f{9,Ws) instead of x*^®^ 7r(x|0). We include us as part of the state of 

our Markov chain. 

4.2 Kernel-e versus Synthetic-likelihood -based Gradients 

In Section 2 we showed that the synthetic-likelihood representation of Cf^{9) is less sensitive to small choices 
of 6. This is particularly important to HABC as our gradient approximations are proportional to differences 
in Ce{9)', if the variance of the stochastic gradients is too high, then we must choose a very small step-size 
ry, eliminating the usefulness of HMC for ABC. Under the deterministic representation of x^®), we can write 
the loglikelihood as 


C^{9) (X \og^N{y\f{9,ujs),e^) (25) 

S 

ss -loge-^(y-/(0,w™))2 (26) 

In the second line we have assumed e is very small and m is the index of the random seed producing the 
closest simulation to y. For a finite difference approximation, dCe{9)/89 is 

((y - f{d - d 0 ,uj-) f - (y - fi9 + de,uj+)f) (27) 

On the other hand, the synthetic-likelihood is stable; using a deterministic representation, we have 

f^e = <^9= f{0,uJs)f (28) 


the gradients (for a 1-dim problem) use e as a smoothness prior in dCe{9)/89: 

2 ^yal_+e^) 2{al,+e^)^ 2{al_+e^) 


(29) 


In Figure 2, as part of our demonstration of HABC, we compare the gradient approximations around the 
true ©map using SL versus kernel-e Figure 2 for a simple problem. We find that although, for this particular 
problem, SL has a small bias due to its Gaussian assumption, it has much smaller variance, an important 
property for HABC. 
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Algorithm 1 VU FDSA-ABC 
inputs: 0, de,/, w, £e, t 
g 0 

for r = 1 : |0| do 
A ^ 0 

Ar 1 

for s = 1 : |u;| do 

^ / (0 + dgA, Ws) 

^ 0 *) ^ J (0 _ dgA,UJs) 

end for 

Qr ^ /:e({x^|!^}) - £e({xL^^}) 

end for 

0^0/2d0 + Vlog7r(0) 

return — g 


Algorithm 2 VU SPSA-ABC 
inputs: 0, dg, /, a;, ^e, R 
g 0 

for r = 1 : i? do 

A ~ 2 •Bemouilli(l/2, |0|) - 1 
for s = 1 : |u;| do 

x^^ ^ / (0 + dgA,Uls) 

xf!^ ^ / (0 - dgA,U]s) 

end for 

g^g+(/:,({x^,*)})-/:,({xL*)})).A-i 

end for 

g 4- g/{2dgR) + V log7r(0) 

return —g 


4.3 From Finite Differences to Simultaneous Perturbations 

Algorithm 1 shows the finite dijference stochastic approximation (FDSA) (Kiefer et ah, 1952) to VU{6) 
as a function of random seeds u). Note we have deliberately shown the deterministic simulations (/) outside 
of Ce to emphasize its dependence on x. The number of simulations required for FDSA is 2SD, which may 
be acceptable for some small ABC problems. Our goal is to scale ABC to high-dimensions and for that we 
need an alternative stochastic approximation of VC/(0). 

In the gradient-free setting, Spall (Spall, 1992, 2000) provides a stochastic approximate to the true gra¬ 
dient using only 2 forward simulations for any dimension D (though the approximation can be improved 
by averaging R estimates). Spall’s simultaneous perturbation stochastic approximation (SPSA) algorithm 
works as follows. Let L be the gradient-free function we wish to optimize. Each approximation randomly 
generates a perturbation mask (our name) A of dimension D = \0\ where entry Ag ~ 2Bemouilli(l/2) — 1. 
Then L is evaluated at 0 -f dgA and 0 — dgA, giving the gradient approximation g{9) « dL{9)/d9: 


m 


L {9 + dgA) - L {9 - dgA) 

2dg 


1/Ai' 

1/A 2 

1/Ao 


(30) 
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e 


Figure 1: A view of a simulator in terms of common random numbers. The horizontal line represents y and red shading 
±2£. The shaded curved region represents 2a of 7r(x|0). The dashed lines are f{G, cOs) for several values of 
LO. The blue circles are potential random samples from 7r(xj6). For a fixed value uja, the simulator produces 
deterministic outputs that change smoothly, even though the simulator itself is quite noisy. 


If we let Qrid) be the estimate using perturbation mask A^, the estimate g{6) can be improved by averaging 
g{9) = ^/R'Yhr 9r{9)- Algorithm 2 shows SPSA to estimate WU{6). The number of simulations required 
for SPSA is 2SR, where i? > 1. 

Variations of SPSA include one-sided SPSA (Spall, 2000) (we use what Spall calls 2SPSA) and an algo¬ 
rithm for estimating the Hessian based on the same principle as SPSA (Spall, 2005). The one-sided version is 
attractive computationally, but for HABC, the updates for 0 require simulating two-sides anyway (once at 0, 
after an step, and once for the one-sided gradient), so using 2SPSA makes more sense. SPSA has also been 
used within a procedure for maximum-likelihood estimation for hidden Markov models using ABC (Ehrlich 
etal., 2013). 

4.4 Common and Sticky Random Numbers 

The usefulness of applying common random numbers (CRNS) in SPSA has been previously demonstrated 
(Kleinman et al., 1999). In that work, the same random numbers are used to simulate both sides of the opti¬ 
mization function within the SPSA gradient. This makes sense intuitively, as we would generally assume that 
the expected simulation function varies smoothly in by using CRNs, this smoothness is easily exploited 
(see Figure 1). If we were to apply SPSA to Bayesian learning, then using CRNs in the gradient step would 
be analogous to using the same mini-batch for both sides of the computation. 

In addition to using CRNs in simulations for each gradient computation, we have found that using per¬ 
sistent random seeds helps HABC explore the parameter landscape more easily for some algorithms and 
problems. Intuitively, for a gradient-based sampling algorithm, it means a particle can slide along a smooth 
Hamiltonian landscape because the additive noise is suppressed. This is very similar to using dependent ran¬ 
dom streams to drive MCMC (Murray & Elliott, 2012; Neal, 2012), the main difference we believe is that 
we are using the Hamiltonian dynamics to drive proposals for 9 and using persistent seeds uj to suppress 
simulation noise. 

Using random seeds (versus, say, a set of random numbers) allows us to treat the simulator as a black¬ 
box, setting the random seed of its RNG without knowing the internal mechanisms it uses to generate random 
numbers. In light of our arguments above, we propose including persistent random seeds w in the state of 
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Figure 2: Variance of gradient estimation using kernel-e and SL for different values of 5 G {5, 50} and fixed e = 0.37 
(the same used in the other results). When S' = 5, the empirical estimates of Vt/(0MAp) are —12 ± 147 
(kemel-e) and -9.3 ± 43 (SL). When S = 50 they are -0.80 ± 19 (kemel-e) and -7.3 ± 4.9 (SL). Note the 
large discrepancy in variance. Note the limit of S —>■ oo, V[/(6 map) = —7.8. The bias if SL gradients is due 
to its Gaussian approximation (smoothed by e) of 7r(xj0), which is a heavy-tailed Gamma distribution (the 
sum of N exponentials). 


our Markov chain. We will now describe a simple Metropolis-Hastings transition operator that randomly 
proposes flipping each seed ojg at time t with some probability 7. 

This Metropolis-Hastings transition conditions of the current parameter location 6 and proposes changing 
a single random seed w (it easily generalizes to S seeds). The procedure is as follows; 1) propose a new seed 
uj ^ q(uj jw) = 7r(w) (independent of the current seed and from its uniform prior); 2) simulate determin- 
isticaiiy x = f{9, w ); 3) compute the acceptance ratio (which reduces to the ratio of 7r(y |x )/7r(y|x)). It 
is straightforward to show that this leaves the target distribution invariant. The probability of the proposal is 
q{x ,uj \6,uj) = Tr{uj )(5(x — f{9, to )), where S{a) is a delta function at a = 0. Because the q has this form, 
acceptance ratio simplifies: 

7re(y|x')7r(u;')7r(x'|g,a;>) Tr{uj)d{-K - f{9,uj)) ^ 7re(y|x') 

7^e(y|x)7r(a;)7r(x|0,u;) 7r(w')i5(x'-/(0, w')) 7re(y|x) 

In pseudo-marginal ABC-MCMC one could propose g(x (fixing 9) and still sample correctly from 
the distribution of simulations with high likelihood at 9. What we propose is slightly different. By instead 
keeping the random seeds fixed, we can sample 9 using HABC and use u) as CRNs within the gradient 
computation step and suppress gradient noise over time. In this way, random seeds carry over the same 
additive noise from one step to the next. 


5. Demonstration 

We use a simple D = 1 problem to demonstrate HABC. Let y = ^ 6*’ where ~ Exp(l/0*); 9* = 

0.15, N = 20, and y = 7.74 in our concrete example. Assuming Tr{9) = Gamma(a, /3), the true posterior is 
a gamma distribution with shape a + N and rate fl + Ny. Our simulator therefore generates the average of N 
exponential random variates with rate A = 1/0. Data x 7r(a;|0) are shown in Figure 1. We have explicitly 
shown the smoothness of the simulator by generating data along trajectories of fixed seeds Wg; i.e. for several 
Wg we vary 0 (dashed lines are function f{6,ujs)) and randomly reveal simulation data (blue circles). The 
horizontal line with shading indicates y ± 2e, where e = 0.37 is used throughout the demonstration. 
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Figure 3: Posterior distributions for the demonstration problem. Top row: No persistent seeds. Bottom row: Persistent 
seeds with 7 = 0.1. Histograms of the posterior estimates are overlaid with the true posterior (dashed line). 
All algorithms (except for SG-Thermostats for non-persistent at) give roughly the same posterior estimate. 
By adding persistent lj SG-Thermostats achieved similar posteriors to the other algorithms. 


5.1 Bias and Variance of VU{9) 

To test our assumption that the synthetic-likelihood model is better suited for HABC, we ran FDSA at the 
true 0MAP- Using S = h and S' = 50 and fixing e = 0.37, we gather lOK gradients samples using kernel-e 
and SL likelihoods. These gradient estimate densities are shown in Figure 2. An unbiased estimate of the 
gradient should be centered at 0. There are two important results. First, the SL estimates have a small bias, 
even at S = 50. This is because it is estimating the true Gamma distribution of 7r(x|0) with a Gaussian. We 
can analytically estimate this bias as S ^ 00; for this example it is —7.8 which is what SL estimates are 
centered around (—9.3 for S' = 5 and 7.3 for S = 50). The kernel-e likelihood, on the other hand, exhibits 
low bias at S = 50. However, the second important result is the variances. SL variances decrease quickly 
with S: cr^ = 43^ —> 4.9^, whereas kernel-e starts very high and remains high: tr^ = 147^ —> 19^. It is 
for this reason that we have chosen to use SL likelihoods for our gradient estimates, despite their small bias. 
As mentioned in Section 4.2 it is possible that other likelihood models, such as KDE, might provide low bias 
and low variance gradient estimates. We leave this for future work. 

5.2 Posterior Inference using HABC 

We ran chains of length 50K for SL-MCMC, SGLD, SGHMC, and SGNHT versions of HABC using SL 
gradient estimates (S = 5). A pseudo-marginal version of SL-MCMC was used. We note that SGHMC 
gave results nearly identical to SGNHT, so are not shown do to space limitations. In one set of experi¬ 
ments, common random seeds were used for gradient computations only, and did not persist over time steps; 
these experiments are called non-persistent. In another set of runs, we resampled ujg at each time step with 
probability 7 = 0.1; these experiments are persistent. In Figure 3 we show the posterior distributions for 
these experiments; in Table 1 we report the total variational distance between the true posterior and the 
ABC posteriors using the first IQK samples and after 50K samples (averaged over 5 chains). Of note is 
the poor approximation of SG-Thermostats when the seeds are not persistent. By adding persistent seeds, 
SG-Thermostats gives similar posteriors to the other methods. 
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Figure 4: Trajectories of the last 1000 6 samples for the demonstration problem. Left column: Non-persistent random 
seeds. Right column: Persistent random seeds with 7 = 0.1. Each algorithm’s parameters were optimized 
to minimize the total variational distance. With persistent seeds, each algorithm’s random walk behavior is 
suppressed. Without persistent seeds, the optimal step-size rj for SG-Thermostats is small, resulting in an 
under-dispersed estimate of the posterior; when the seeds are persistent, the gradients are more consistent, 
and the optimal step-size is larger and therefore there is larger injected noise. The resulting posteriors are 
shown in Figure 3. 


In Figure 4 we show the trace plots of the last 1000 samples from a single chain for each algorithm. 
In the left column, traces for non-persistent random seeds are shown, and on the right, traces for persistent 
seeds. We can observe that persistent random seeds further reduces the random walk behavior of all three 
methods. We also observe small improvements in total variational distance for SL-MCMC and SOLD, while 
SGNHT improves significantly. We find this a compelling mystery. Is it because of the interaction between 
hyperparameters and stochastic gradients, or is this an artifact of this simple model? 


6. Experiments 

We present experimental results comparing HABC with standard ABC-MCMC for two challenging simu¬ 
lators. The first is the blowfly model which uses stochastic differential equations to model possibly chaotic 
population dynamics (Wood, 2010). Although it is a low-dimensional problem, the noise and chaotic behav¬ 
ior of the model make it challenging for gradient-based sampling. Our second experiment applies HABC to 
a Bayesian logistic regression model. Although we only use 2 classes (O’s versus I’s), the dimensionality is 
very high {D = 1568). We show that HABC can work well despite using SPSA gradients. 
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Table 1: Average total variational distance (tvd) for the demonstration problem. Non-persistent used no 
persistent random seeds, whereas Persistent randomly proposes a new ujg with 7 = 0.1. Each 
algorithms’ parameters were optimized for minimal tvd after lOK samples. The results for SGHMC 
(not shown) and SGNHT are nearly identical. 



Non-persistent 

Persistent 

Algo 

lOK 

50K 

lOK 

50K 

SL-ABC 

0.047 

0.045 

0.045 

0.045 

SGFD 

0.049 

0.048 

0.048 

0.043 

SGNHT 

0.232 

0.239 

0.055 

0.051 


6.1 Blowfly 

For these experiments, a simulator of adult sheep blowfly populations (Wood, 2010) is used with statistics 
set to those from (Meeds & Welling, 2014). The observational vector y is a time-series of a fly population 
counted daily. The population dynamics are modeled using a stochastic differential equation' 

Nt+i = PNt-r exp{-Nt-r/No)et + Nt exp(-i5et) 

where et ~ Q{l/ap, l/cip) and ct ~ (/(l/cr^, 1/cr^) are sources of noise, and r is an integer. In total, there 
are I? = 6 parameters 6 = {log P, log S, log Nq, log ad, log ap, t}. As (Meeds & Welling, 2014) we place 
broad log-normal priors over 61 , ,,5 and a Poisson prior over r. This is considered a challenging problem 
because slight changes to some parameter settings can produce degenerate x, while others settings can be 
very noisy due to the chaotic nature of the equations. The statistics from (Meeds & Welling, 2014) are used 
(J = 10): the log average of 4 quantiles of W/1000, the average of 4 quantiles of the first-order differences 
in W/1000, and the number of maximal population peaks under two different thresholds. 

We compare difference HABC algorithms with ABC-MCMC for the blowfly population problem. We use 
e = {1/2,1/2,1/2,1/2,1/4,1/4,1/4,1/4,3/4,3/4} (slightly different e from (Meeds & Welling, 2014)) 
and S' = 10 for all experiments. We use SPSA with R = 2 using SL log-likelihoods for all HABC gradient 
estimates. Without persistent seeds, the number of simulations per time-step is 2SR (about double marginal 
ABC-MCMC) and with it is 2SR + 2 S 7 . 

Figure 5 show the posterior distributions for three parameters for SF-MCMC, SGFD, and SG-Thermostats 
using non-persistent seeds (persistent seeds, not shown, produced very similar posteriors). In the second row 
we show the trajectories of two parameters, clearly showing the suppressed random walk behavior of SGFD 
and SG-Thermostats relative to ABC-MCMC. In Figure 6 the scatter plots of trajectories are shown for two 
parameters. Though not shown due to space limitations, we have found that persistent seeds can improve 
convergence of the posterior predictive distribution. Further experiments with persistent seeds needs to be 
carried out to understand the extent to which the help and how to determine when they are necessary, if at all. 

6.2 Bayesian Logistic Regression 

We perform Bayesian inference on Bayesian inference on a logistic regression model using the digits 0 and 
1 from MNIST. Despite its simplicity, the model still represents a high-dimensional problem for HABC 
(D = 1568). We first ran stochastic gradient descent to determine 0 map using the true gradient. We then run 
HABC SGFD and SG-Thermostats starting 0 map to discover how well the algorithms explore the posterior. 
We compare with SGFD and SG-Thermostats using the true gradients. We use n = 100 size mini-batches and 
i? = 10 number of perturbations for SPSA. Figure 7 shows samples randomly projected onto 2 dimensions 
(1000 evenly sub-sampled from lOK). We can see that the trajectories using SPSA exhibit similar behavior 

1. Equation 1 in Section 1.2.3 of the supplementary information in (Wood, 2010). 
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Figure 5; Blowfly posterior distributions (non-persistent seeds). Left column: Posteriors for three parameters for SL- 
MCMC (top row), SOLD (middle), and SG-Thermostats (bottom). Right column: Trajectories of the last 
1000 samples for two parameters. 
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Figure 6: Blowfly trajectories of two parameters over the last 1000 time-steps. Left: SL-MCMC, Middle: SGLD 
and Right: SG-Thermostats. Relative to SL-MCMC, the Hamiltonian dynamics clearly show persistent G 
trajectories. 


to Bayesian learning with the true gradients. This is a positive result that indicates HABC can successfully 
exploit the noisy and less informative gradients of SPSA. 
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Logistic Regression Sampling Trajectories 
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Figure 7: Bayesian logistic regression sampling trajectories randomly projected. The yellow circle is the projected 
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7. DISCUSSION AND CONCLUSION 

Hamiltonian ABC proposes a new set of algorithms for Bayesian inference of likelihood-free models. HABC 
builds upon the connections between Hamiltonian Monte Carlo with stochastic gradients and well-established 
gradient approximations based on a minimal number of forward simulations, even in high-dimensions. We 
have performed some preliminary experiments showing the feasibility of running ABC on both small and 
large problems, and we hope that the door has been opened for exploration of larger simulation-based models 
using HABC. 

Another innovation we introduce is the use of persistent random seeds to suppress the simulator noise and 
therefore smooth the simulation landscape over a local region of parameter space. For some algorithms run 
on certain models, improved performance has been observed. This is most likely to be the case for simulators 
with large additive noise and algorithms that benefit from long Hamiltonian trajectories (i.e. SGHMC and 
SG-Thermostats). We feel that new classes of ABC algorithms could develop from using persistent random 
seeds, not just gradient-based samplers but traditional ABC-MCMC. 

There are several unresolved and open questions regarding the application of stochastic gradients to ABC. 
The first issue is the importance of the bias-variance relationship for different ABC likelihood models. We 
found that using gradients based on the synthetic-likelihood greatly reduced their variance, but introduced a 
small bias, because of its Gaussian assumption. The second issue is setting algorithm parameters, in particular 
the step-sizes rj, the injected noise C (for SGHMC/SGNHT), and the number of SPSA repetitions R. All of 
these parameters are highly interactive. Can we use statistical tests during the MCMC run to determine R1 
Should r] and C be set differently in the ABC setting? One final issue is monitoring or determining whether 
the correct amount of noise is being injected to ensure proper sampling. In SGLD (Welling & Teh, 2011), for 
example, we can always turn down rj so that the injected noise term dominates, but when our goal is efficient 
exploration of the posterior, this is not a very satisfying solution. 

Expensive simulators are an important class of models that we do not address in this work. However, 
previous work in Bayesian inference has shown the usefulness of HMC-based proposals based on Gaussian 
process of log-likelihood surfaces (Rasmussen, 2003). We could similarly use HABC with ABC surrogate 
models (Meeds & Welling, 2014; Wilkinson, 2014) to minimize simulation calls, yet still benefit from Hamil¬ 
tonian dynamics. 
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